Main Goal:

whether a particular booking which was done by user is going to be canceled or not?

Problem Statements:

  1. Where do guests come from?
  2. How much do guests pay for a night?
  3. How does the price per night vary over the year?
  4. Which are the most busy months or in which months Guests are high?
  5. How long do people stay at the hotels?
  6. How long do people stay at the hotels?

Examining the data:

we need to check three columns: Adult, Children, and Babies to see if there is a row with all these three values as zeros at the same time! It is not possible to have all three as zeros.

There are 180 rows (adults, children, babies) with zeros at the same time that need to be excluded from the dataset.

The City hotel has more guests during spring and autumn, when the prices are also highest. In July and August there are less visitors, although prices are lower.

Guest numbers for the Resort hotel go down slighty from June to September, which is also when the prices are highest. Both hotels have the fewest guests during the winter.

NEXT STEP:

>> Finding Corelations

for corelations we use the cleaned dataset where only missing values and other abnormalities where taken care of;

From this list it is apparent that lead_time, total_of_special_requests, required_car_parking_spaces, booking_changes and previous_cancellations are the 5 most important numerical features. However, to predict whether or not a booking will be canceled, the number of booking changes is a possible source of leakage, because this information can change over time. I will also not include "days_in_waiting_list" and "arrival_date_year".

The most important feature to exclude is the "reservation_status":

Categorical Features Dataset:

Mean Encoding:

In this stage, we use "Mean Encoding" technique to care of the categorical variables. By mean encoding we convert the categorical variables to numbers (means) with respect to the "cancellation" column. It is basically grouping a column with respect to another column and then calculating the mean:

Concatenating numerical and categorical databases:
Handling the outliers:

In order to normilize the data in these columns, we use log;

Defining the features:

Feature importance:

Applying machine learning algorithm:

Deploying Models: